9 research outputs found

    Exploring OpenMP Accelerator Model in a real-life scientific application using hybrid CPU-MIC platforms

    Get PDF
    Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016). Sofia (Bulgaria), October, 6-7, 2016.The main goal of this paper is the suitability assessment of the OpenMP Accelerator Model (OMPAM) for porting a real-life scientific application to heterogeneous platforms containing a single Intel Xeon Phi coprocessor. This OpenMP extension is supported from version 4.0 of the standard, offering an unified directive-based programming model dedicated for massively parallel accelerators. In our study, we focus on applying the OMPAM extension together with the OpenMP tasks for a parallel application which implements the numerical model of alloy solidification. To map the application efficiently on target hybrid platforms using such constructs as omp target, omp target data and omp target update, we propose a decomposition of main tasks belonging to the computational core of the studied application. In consequence, the coprocessor is used to execute the major parallel workloads, while CPUs are responsible for executing a part of the application that do not require massively parallel resources. Effective overlapping computations with data transfers is another goal achieved in this way. The proposed approach allows us to execute the whole application 3.5 times faster than the original parallel version running on two CPUs.This research was conducted with the support of COST Action IC1305 (NESUS), as well as the National Science Centre (Poland) under grant no. UMO-2011/03/B/ST6/03500. The authors are grateful to the Czestochowa University of Technology for granting access to Intel Xeon Phi coprocessors provided by the MICLAB project no. POIG.02.03.00.24-093/13 (http://miclab.pl)

    Adaptation of MPDATA Heterogeneous Stencil Computation to Intel Xeon Phi Coprocessor

    No full text
    The multidimensional positive definite advection transport algorithm (MPDATA) belongs to the group of nonoscillatory forward-in-time algorithms and performs a sequence of stencil computations. MPDATA is one of the major parts of the dynamic core of the EULAG geophysical model. In this work, we outline an approach to adaptation of the 3D MPDATA algorithm to the Intel MIC architecture. In order to utilize available computing resources, we propose the (3 + 1)D decomposition of MPDATA heterogeneous stencil computations. This approach is based on combination of the loop tiling and fusion techniques. It allows us to ease memory/communication bounds and better exploit the theoretical floating point efficiency of target computing platforms. An important method of improving the efficiency of the (3 + 1)D decomposition is partitioning of available cores/threads into work teams. It permits for reducing inter-cache communication overheads. This method also increases opportunities for the efficient distribution of MPDATA computation onto available resources of the Intel MIC architecture, as well as Intel CPUs. We discuss preliminary performance results obtained on two hybrid platforms, containing two CPUs and Intel Xeon Phi. The top-of-the-line Intel Xeon Phi 7120P gives the best performance results, and executes MPDATA almost 2 times faster than two Intel Xeon E5-2697v2 CPUs

    Large-Scale Parallelization of Human Migration Simulation

    No full text
    Forced displacement of people worldwide, for example, due to violent conflicts, is common in the modern world, and today more than 82 million people are forcibly displaced. This puts the problem of migration at the forefront of the most important problems of humanity. The Flee simulation code is an agent-based modeling tool that can forecast population displacements in civil war settings, but performing accurate simulations requires nonnegligible computational capacity. In this article, we present our approach to Flee parallelization for fast execution on multicore platforms, as well as discuss the computational complexity of the algorithm and its implementation. We benchmark parallelized code using supercomputers equipped with AMD EPYC Rome 7742 and Intel Xeon Platinum 8268 processors and investigate its performance across a range of alternative rule sets, different refinements in the spatial representation, and various numbers of agents representing displaced persons. We find that Flee scales excellently to up to 8192 cores for large cases, although very detailed location graphs can impose a large initialization time overhead
    corecore